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Abstract 

As data sets grow in dimensionality, non-parametric measures of dependence have seen 
increasing use in data exploration due to their ability to identify non-trivial relationships 
of all kinds. One common use of these tools is to test a null hypothesis of statistical 
independence on all variable pairs in a data set. However, because this approach attempts to 
identify any non-trivial relationship no matter how weak, it is prone to identifying so many 
relationships — even after correction for multiple hypothesis testing — that meaningful 
follow-up of each one is impossible. What is needed is a way of identifying a smaller set of 
“strongest” relationships of all kinds that merit detailed further analysis. 

Here we formally present and characterize equitability, a property of measures of depen¬ 
dence that aims to overcome this challenge. Notionally, an equitable statistic is a statistic 
that, given some measure of noise, assigns similar scores to equally noisy relationships of 
different types (e.g., linear, exponential, etc.) [1]. We begin by formalizing this idea via a 
new object called the interpretable interval, which functions as an interval estimate of the 
amount of noise in a relationship of unknown type. We define an equitable statistic as one 
with small interpretable intervals. 

We then draw on the equivalence of interval estimation and hypothesis testing to show 
that under moderate assumptions an equitable statistic is one that yields well powered 
tests for distinguishing not only between trivial and non-trivial relationships of all kinds 
but also between non-trivial relationships of different strengths, regardless of relationship 
type. This means that equitability allows us to specify a threshold relationship strength 
Xq below which we are uninterested, and to search a data set for relationships of all kinds 
with strength greater than xq. Thus, equitability can be thought of as a strengthening of 
power against independence that enables fruitful analysis of data sets with a small number 
of strong, interesting relationships and a large number of weaker, less interesting ones. We 
conclude with a demonstration of how our two equivalent characterizations of equitability 
can be used to evaluate the equitability of a statistic in practice. 


1 Introduction 

Suppose we have a data set that we would like to explore to find pairwise associations of interest. 
A commonly taken approach that makes minimal assumptions about the structure in the data 
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is to compute a measure of dependence, i.e., a statistic whose population value is non-zero 
exactly in cases of statistical dependence, on many candidate pairs of variables. The score of 
each variable pair can be evaluated against a null hypothesis of statistical independence, and 
variable pairs with significant scores can be kept for follow-up [2, 3]. When faced with this 
task, there is a wealth of measures of dependence from which to choose, each with a different 
set of properties [4-13] . 

While this approach works well in some settings, it is unsuitable in many others due to the 
size of modern data sets. In particular, as data sets grow in dimensionality, the above approach 
often results in lists of significant relationships that are too large to allow for meaningful follow¬ 
up of every identified relationship. For example, in the gene expression data set analyzed in 
[14], several measures of dependence reliably identified thousands of significant relationships 
amounting to between 65 and 75 percent of the variable pairs in the data set. Given the 
extensive manual effort that is usually necessary to better understand each of these “hits”, 
further characterizing all of them is impractical. 

A tempting way to deal with this challenge is to rank all the variable pairs in a data set 
according to the test statistic used (or according to p-value) and to examine only a small 
number of pairs with the most extreme values. However, this is a poor idea because, while a 
measure of dependence guarantees non-zero scores to dependent variable pairs, the magnitude 
of these non-zero scores can depend heavily on the type of dependence in question, thereby 
skewing the top of the list toward certain types of relationships over others. For example, if 
some measure of dependence ip systematically assigns higher scores to, say, linear relationships 
than to sinusoidal relationships, then using (p to rank variable pairs in a large data set could 
cause noisy linear relationships in the data set to crowd out strong sinusoidal relationships 
from the top of the list. The natural result would be that the human examining the top-ranked 
relationships would never see the sinusoidal relationships, and they would not be discovered. 

The consistency guarantee of measures of dependence is therefore not strong enough to 
solve the data exploration problem posed here. What is needed is a way not just to identify 
as many relationships of different kinds as possible in a data set, but also to identify a small 
number of strongest relationships of different kinds. 

Here we formally present and characterize equitability, a framework for meeting this goal. 
In previous work, equitability was informally introduced as follows: an equitable measure of 
dependence is one that, given some measure of noise, assigns similar scores to equally noisy 
relationships, regardless of relationship type [1]. In this paper, we formalize this notion in the 
language of estimation theory and tie it to the theory of hypothesis testing. 

Specifically, we define an object called the interpretable interval that functions as an interval 
estimate of the strength of a relationship of unknown type. That is, given a set Q of standard 
relationships on which we have defined a measure of relationship strength, the interpretable 
interval is a range of values that act as good estimates of the true relationship strength <I> of 
a distribution, assuming it belongs to Q. In the same way that a good estimator has narrow 
confidence intervals, an equitable statistic is one that has narrow interpretable intervals. As 
we explain, this property can be viewed as a natural generalization of one of the “fundamental 
properties” described by Renyi in his framework for measures of dependence [15]. 

We then draw a connection between equitability and statistical power using the equivalence 
between interval estimation and hypothesis testing. This connection shows that whereas typical 
measures of dependence are analyzed in terms of power to distinguish non-trivial associations 
from statistical independence, under moderate assumptions an equitable statistic is one that 
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can distinguish finely between relationships of two different strengths that may both be non¬ 
trivial, regardless of the types of the two relationships in question. This result gives us a new 
way to understand equitability as a natural strengthening of the requirement of power against 
independence in which we ask that our statistic be useful not just for detecting deviations of 
different types from independence but also for distinguishing strong relationships from weak 
relationships regardless of relationship type. 

Finally, motivated by the connection between equitability and power, we dehne a new 
property, detection threshold, which, at some fixed sample size, is the minimal relationship 
strength x snch that a statistic’s corresponding independence test has a certain minimal power 
on relationships of all kinds with strength at least x. We show that low detection threshold is 
strictly weaker than high equitability in that high equitability implies it but the converse does 
not hold. Therefore, when equitability is too much to ask, low detection threshold on a broad 
set of relationships with respect to an interesting measure of relationship strength may be a 
reasonable snrrogate goal. 

Thronghont this paper, we give concrete examples of how onr formalism relates to the 
analysis of equitability in practice. Indeed, the purpose of the theoretical framework provided 
here is to allow for such practical analyses, and so we close with a demonstration of an empirical 
analysis of the eqnitability of several popular measures of dependence. 

This paper is accompanied by two companion papers. The first [4] introduces two new 
statistics that aim for good equitability on functional relationships and good power against 
statistical independence, respectively. The second [16] conducts a comprehensive empirical 
analysis of the eqnitability and power against independence of both of these new methods as 
well as several other leading measnres of dependence. 

The results we present here, in addition to contributing to a better understanding of eq¬ 
uitability, also provide an organizing framework in which to consolidate some of the recent 
discussion around equitability. For instance, onr formalization of equitability is sufficiently 
general to accommodate several of variants that have arisen in the literature. This allows us 
to precisely discuss the definition given by Kinney and Atwal [17, 18] of what, in onr theoret¬ 
ical framework, corresponds to perfect eqnitability. In particular, our framework allows us to 
explain the limitations of an impossibility resnlt presented by Kinney and Atwal about perfect 
equitability. Additionally, onr framework and the connection it provides to statistical power 
also allows us to crystallize and address the concerns abont the power against independence of 
equitable methods raised by Simon and Tibshirani [19]. (However, empirical questions concern¬ 
ing the performance of the maximal information coefficient and related statistics are deferred 
to the companion papers [4, 16].) 

We conclude with a discussion of what situations benefit from using equitability as a desider¬ 
atum for data analysis. It is our hope that the theoretical results in this paper will provide 
a foundation for further work not only on eqnitability and methods for achieving equitability, 
but also on other possible expansions of our goals for measures of dependence in the setting of 
data exploration or other related settings. 

2 Equitability 

Equitability has been described informally by the anthors as the ability of a statistic to “give 
similar scores to eqnally noisy relationships of different types” [1]. Thongh useful, this informal 
dehnition is imprecise in that it does not specify what is meant by “noisy” or “similar”, and 
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does not specify for which relationships the stated property should hold. In this section we 
provide the formalism necessary to discuss equitability more rigorously. 

To do this, we hx a statistic ip (presumed to be a measure of dependence), a measure of 
relationship strength called the property of interest, and a set Q of standard relationships on 
which $ is defined. The idea is that Q contains relationships of many different types, and for 
any distribution Z € Q, ^{Z) is the way we would ideally quantify the strength of Z if we had 
knowledge of the distribution Z. Our goal is then, given a sample Z of size n from Z, to use 
p{Z) to draw inferences about ^{Z). 

Our general approach is to construct a set of intervals, the interpretable intervals of p with 
respect to by inverting a certain set of hypothesis tests. We show that these intervals can 
be used to turn p{Z) into an interval estimate of ^{Z), and we call the statistic p equitable if 
its interpretable intervals are small, i.e., if it yields narrow interval estimates of ^{Z). 

After constructing the interpretable intervals of p with respect to we demonstrate how 
our vocabulary can be used to define a few different concrete instantiations of the concept 
of equitability. We do this by using our framework to state several of the notions of- and 
results about equitability that have appeared in the literature, and discussing the relationships 
among them. Following this, we provide a short schematic illustration of how the definitions 
we provide would be used to quantitatively evaluate the equitability of a statistic in practice, 
and a discussion of how equitability is related to measurement of effect size more generally. 

In what follows, we keep our exposition generic in order to accommodate variations - both 
existing and potential - on the concepts defined here. However, as a motivating example, we 
often return to the setting of [1], in which (,3 is a statistic like the maximal information coefficient 
MICe, Q is a set of noisy functional relationships, and is the coefficient of determination 
with respect to the generating function. In this setting, the equitability of MICe corresponds 
to its utility for constructing narrow interval estimates of the of a relationship that is in Q 
but whose specific functional form is unknown. 

2.1 Interpretable intervals 

Let p he a statistic taking values in [0,1], let Q be a set of distributions, and let $ : Q —)> [0,1] 
be some measure of relationship strength. As mentioned previously, we refer to Q as the set 
of standard relationships and to $ as the property of interest. To construct the interpretable 
intervals of p with respect to $, we must first ask how much p can vary when evaluated on 
a sample from some Z € Q with ^(Z) = x. The definition below gives us a way to measure 
this. (In this definition and in dehnitions in the rest of this paper, we implicitly assume a hxed 
sample size of n.) 

Definition 2.1 (Reliability of a statistic). Let p he a statistic taking values in [0,1], and let 
X, a G [0, 1]. The a-reliable interval of p at x, denoted by Ra (x), is the smallest closed interval 
A with the property that, for all Z € Q with ^(Z) = x, we have 

P {p{Z) < min A) < a/2 and P {p{Z) > maxA) < a/2 
where Z is a sample of size n from Z. 

The statistic p is 1 /d-reliable with respect to on Q at x with probability 1 — a if and 
only if the diameter of Ra (x) is at most d. 
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See Figure la for an illustration. The reliable interval at x is an acceptance region of a 
size-a test of the null hypothesis Hq : = x. If there is only one Z satisfying ^{Z) = x, 

this amounts to a central interval of the sampling distribution of (p on Z. If there is more 
than one such Z, the reliable interval expands to include the relevant central intervals of the 
sampling distributions of ip on all the distributions Z in question. For example, when Q is a set 
of noisy functional relationships with several different function types and $ is the reliable 
interval at x is the smallest interval A such that for any functional relationship Z £ Q with 
E?‘{Z) = X, ip{Z) falls in A with high probability over the sample Z of size n from Z. 

Because the reliable interval Rq (x) can be viewed as the acceptance region of a level-a 
test of Hq : ^(Z) = x, the equivalence between hypothesis tests and confidence intervals yields 
interval estimates of <I> in terms of Ra (x). These intervals are the interpretable intervals, 
defined below. 

Definition 2.2 (Interpretability of a statistic). Let (pho a statistic taking values in [0,1], and 
let y,a ^ [0,1]. The a-interpretable interval of ip at y, denoted by (y), is the smallest closed 
interval containing the set 

[x £ [0,1] ■■ y ^ K (a:)}- 

The statistic (p is 1 /d-interpretable with respect to $ on Q at y with confidence 1 — a if 
and only if the diameter of (y) is at most d. 

See Figure la for an illustration. The correspondence between hypothesis tests and in¬ 
terval estimates [20] gives us the following guarantee about the coverage probability of the 
interpretable interval, whose proof we omit. 

Proposition 2.3. Let p be a statistic taking values in [0,1], and let a £ [0,1]. For all x £ [0,1] 
and for all Z £ Q, 

p[^Z)£li{p{Z)))>l-a 
where Z is a sample of size n from Z. 

The definitions just presented have natural non-stochastic counterparts in the large-sample 
limit that we summarize below. 

Definition 2.4 (Reliability and interpretability in the large-sample limit). Let y? : Q —)• [0,1] 
be a function of distributions. For x £ [0,1], the smallest closed interval containing the set 
(^(<I>“^({x})) is called the reliable interval of p ai x and is denoted by R^ (x). For y £ [0,1], the 
smallest closed interval containing the set {x y £ R^ (x)} is called the interpretable interval 
of p at y and is denoted by (y). 

See Figure lb for an illustration. 

2.2 Defining equitability 

Proposition 2.3 implies that if the interpretable intervals of p with respect to ‘h are small then 
p will give good interval estimates of <1>. There are many ways to summarize whether the 
interpretable intervals of p are small; we focus here on two simple ones. 
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Figure 1: A schematic illustration of reliable and interpretable intervals. In both figure parts, Q 
consists of noisy relationships of three different types depicted in the three different colors, (a) The 
relationship between a statistic tp and $ on Q at a finite sample size. The bottom and top boundaries of 
each shaded region indicate the (a/2)100% and (1 — q;/2) • 100% percentiles of the sampling distribution 
of ip for each relationship type at various values of $. The vertical interval (in black) is the reliable 
interval (x), and the horizontal interval (in red) is the interpretable interval (y). (b) In the large- 
sample limit, we replace p with a population quantity p. The vertical interval (in black) is the reliable 
interval (a;), and the horizontal interval (in red) is the interpretable interval (y). 


Definition 2.5. The worst-case a-reliability (resp. a-interpretability) of (,5 is l/d if it is 1/d- 
reliable (resp. interpretable) at all x (resp. y) G [0,1]. (p is said to be worst-case 1/d-reliahle 
(resp. 1/d-interpretable) with probability (resp. confidence) 1 — a. 

The average-case a-reliability (resp. a-interpretability) of is 1/d if its reliability (resp. 
interpretability), averaged over all x (resp. y) G [0,1], is at least 1/d. pis said to be average-case 
1 /d-reliable (resp. 1/d-interpretable) with probability (resp. confidence) 1 — a. 

(One could imagine more fine-grained ways to summarize reliability/interpretability ac¬ 
cording to, for example, some prior over the distributions in Q that reflects a belief about the 
importance or prevalence of various types of relationships; for simplicity, we do not pursue this 
here.) 

With this vocabulary, we can now define equitability: average/worst-ease equitability is 
simply average/worst-case interpretability with respect to some that reflects relationship 
strength. In this paper, we distinguish between interpretability in general and equitability 
specifically by using “interpretability” in general statements and “equitability” in contexts 
in which is specifically considered as a measure of relationship strength. Also, we often 
use “interpretability” and “equitability” with no qualifier to mean worst-case interpretabil¬ 
ity/equitability. 

The corresponding definitions of average/worst-case interpretability/reliability can be made 
for p in the large-sample limit as well. In that setting, it is possible that all the interpretable 
intervals of p with respect to have size 0; that is, the value of p{Z) uniquely determines the 
value of ^{Z). In this case, the worst-case reliability/interpretability of p is oo, and p is said 
to be perfectly reliable/interpretable, or perfectly equitable depending on context. 

Before continuing, let us build intuition by giving two examples of statistics that are per¬ 
fectly interpretable in the large-sample limit. First, the mutual information [21, 22] is perfectly 
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interpretable with respect to the correlation on the set Q of bivariate normal random vari¬ 
ables. This is because for bivariate normals we have that 1 — 2“^^ = [23]. Additionally, 

Theorem 6 of [24] shows that for bivariate normals distance correlation is a deterministic func¬ 
tion of p^ as well. Therefore, distance correlation is also perfectly interpretable and perfectly 
reliable with respect to p^ on the set of bivariate normals Q. 

The perfect interpretability with respect to p^ on bivariate normals exhibited in both of 
these examples is in fact equivalent to one of the “fundamental properties” introduced by 
Renyi in his framework for thinking about ideal properties of measures of dependence [15]. 
This property contains a compromise: it guarantees interpretability that on the one hand is 
perfect, but on the other hand applies only on a relatively small set of standard relationships. 
One goal of equitability is to give us the tools to relax the “perfect” requirement in exchange 
for the ability to make Q a much larger set, e.g., a set of noisy functional relationships. Thus, 
equitability can be viewed as a generalization of Renyi’s requirement that allows for a tradeoff 
between the precision with which our statistic tells us about and the set Q on which it does 
so. 


2.3 Examples of- and results about equitability 

We now give examples, using the vocabulary developed here, of some concrete instantiations of- 
and results about equitability. Our focus here is on functional relationships, as defined below. 

Definition 2.6. A random variable distributed over is called a noisy functional relationship 
if and only if it can be written in the form {X e,f{X) + e') where / : [0,1] —)> M, X is a 
random variable distributed over [0,1], and e and e' are (possibly trivial) random variables. We 
denote the set of all noisy functional relationships by ^. 

2.3.1 Equitability on functional relationships with respect to B? 

We can now state one specific type of equitability on functional relationships: equitability with 
respect to . 

Definition 2.7 (Equitability on functional relationships with respect to R?). Let Q C be 
a set of noisy functional relationships. A measure of dependence is 1 /d-equitable on Q with 
respect to R^ if it is l/d-interpretable with respect to R^ on Q. 

We observe that this dehnition still depends on the set Q in question. The general approach 
taken in the literature thus far has been to fix some set F of functions that on the one hand 
is large enough to be representative of relationships encountered in real data sets, but on the 
other hand is small enough to enable empirical analysis, and to make equitability a realistic 
goal. 

As important as the choice of functions to include in F is the choice of marginal distribu¬ 
tions and noise model, both of which are left unspecified in our definition of noisy functional 
relationships. In past work, we have examined several possibilities. The simplest is X ~ Unif, 
e' ~ AA(0, (T^) with a varying, and e = 0. Slightly more complex noise models include having 
e and e' i.i.d. Gaussians, or having e be Gaussian and e' = 0. More complex marginal distri¬ 
butions include having X be distributed in a way that depends on the graph of /, or having it 
be non-stochastic [1, 16]. Given that we often lack a neat description of the noise in real data 
sets, we would ideally like a statistic to be highly equitable on as many different such models 
as possible. 
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We can also easily imagine models besides the ones described above: for instance, we might 
define Ea and £(, to be non-Gaussian, we might allow them to depend on each other, or we 
might allow their variance to depend on f{X). The importance of such modifications depends 
on the context, but our formalism is designed to be flexible enough to handle general models 
that include such variations. 

2.3.2 A setting in which perfect equitability is impossible 

One version of equitability on functional relationships for which perfect equitability has been 
shown to be impossible was introduced by Kinney and Atwal [17]. This version of equitability 
uses as standard relationships the set 

QK = {iX,f{X)+7]) I / : [0,l]^[0,l],(r/TA)|/(X)} 

with 7] representing a random variable that is conditionally independent of X given f{X). This 
model describes functional relationships with noise in the second coordinate only, where that 
noise can depend arbitrarily on the value of f{X) but must be otherwise independent of X. 

Kinney and Atwal prove that no non-trivial measure of dependence can be perfectly worst- 
case interpretable with respect to on the set Qk- However, we note here that this result, 
while interesting, has two serious limitations. The first limitation, pointed out by Murrell et 
al. in the technical comment [25], is that Qk is extremely large: in particular, the fact that 
the noise term rj can depend arbitrarily on the value of f{X) leads to identifiability issues such 
as obtaining the noiseless relationship f{X) = as a noisy version of f{X) = X. The more 
permissive (i.e. large) a model is, the easier it is to prove an impossibility result for it. Since 
Qk is not contained in the other major models considered in, e.g., [1] and [16], it follows that 
this impossibility result does not imply impossibility for any of those models. 

The second limitation of Kinney and Atwal’s result is that it only addresses perfect equitabil¬ 
ity rather than the more general, approximate notion with which we are primarily concerned.^ 
While a statistic that is perfectly equitable with respect to may indeed be difficult or even 
impossible to achieve for many large models Q including some of the models in [1] and [16] , such 
impossibility would make approximate equitability no less desirable a property. The question 
thus remains how equitable various measures are, both provably and empirically. To borrow 
an analogy from computer science, the fact that a problem is proven to be NP-complete does 
not mean that we that we do not want efficient algorithms for the problem; we simply may 
have to settle for approximate solutions. Similarly, there is merit in searching for measures of 
dependence that appear to be highly equitable with respect to R^ in practice. 

For more on this discussion, see the technical comment [18]. 

^ As a matter of record, we wish to clarify a confusion in Kinney and Atwal’s work. They write “The key claim 
made by Reshef et al. in arguing for the use of MIC as a dependence measure has two parts. First, MIC is said 
to satisfy not just the heuristic notion of equitability, but also the mathematical criterion of i?^-equitability...”, 
with the latter term referring to what we here define as perfect equitability [17]. However, such a claim was never 
made in our previous work [1], Rather, that paper [1] informally defined equitability as an approximate notion 
and compared the equitability of MIC, mutual information estimation, and other schemes empirically, concluding 
not that MIC is perfectly equitable but rather that it is the most equitable statistic available in a variety of 
settings. One method can be more equitable than another, even if neither method is perfectly equitable. 



2.4 Quantifying equitability via interpretable intervals 

Let us give a simple demonstration of how the formalism above can be used to empirically 
quantify equitability with respect to on a specific set of noisy functional relationships. We 
take as our statistic the sample correlation p. Since this statistic is meant to detect linear 
dependencies, we do not expect it to be equitable on a broad class of relationships. In fact it 
is not even a measure of dependence, since its population value can be zero for relationships 
with non-trivial dependence. However, we analyze it here as an instructional example since 
it is widely used and gives intuitive scores. We analyze the equitability of other statistics in 
Section 5. 

Figure 2a shows an analysis of the equitability with respect to R? of /i at a sample size of 
n = 500 on the set 

Q = {(X, f{X) + e'„) : X ~ Unif, e;, ~ X(0, a^)JeF,ae M>o} 

where X is a set of 16 functions analyzed in [16]. (See Appendix A.) 

To evaluate the equitability of p in this context, we generate, for each function f G F and 
for 41 noise levels chosen for each function to correspond to R^ values uniformly spaced in 
[0,1], 500 independent samples of size n = 500 from the relationship = (A,/(X) + e'^). 
We then evaluate p on each sample to estimate the 5th and 95th percentiles of the sampling 
distribution of p on Zj o-- By taking, for each cr, the maximal 95th percentile value and the 
minimal 5th percentile value across all f G F, we obtain estimates of the 0.1-reliable interval 
at each noise level. From the reliable intervals we can then construct interpretable intervals, 
and the equitability of p is the reciprocal of the length of the largest interpretable interval. 

As expected, the interpretable intervals at many values of p are large. This is because 
our set of functions F contains many non-linear functions, and so a given value of p can be 
assigned to relationships of different types with very different R? values. This is shown by the 
pairs of thumbnails in the figure, each of which depicts two relationships with the same p but 
different values of R?. Thus, p has poor equitability with respect to R? on this set Q. In 
contrast, Figure 2b depicts the way this analysis would look if p were perfectly equitable: all 
the interpretable intervals would have size 0. 

2.5 Discussion 

In this section we formalized the notion of equitability via the concepts of reliability and inter- 
pretability. Given a statistic (p and a measure of relationship strength <1> defined on some set Q 
of standard relationships, we constructed a set of intervals called the interpretable intervals of 
(p with respect to 4>. We constructed the interpretable intervals so they yield interval estimates 
of <I>, and we then defined the (worst-case) equitability of ip to be the inverse of the size of the 
largest interpretable interval. 

Strictly speaking, equitability simply requires that a natural set of confidence intervals 
obtained from analyzing ip as an estimator of <I> be small. However, there is a subtlety here: since 
in our setting Q typically contains several different relationship types, there are usually multiple 
relationships in Q with a given value of $. This is different from the conventional framework 
of estimation of a parameter 9, in which we assume that there is exactly one distribution with 
any given value of 9, and we must account for this difference in our definitions. 

When Q is so small that this subtlety does not arise, equitability becomes a less rich 
property. To see this, notice that if there is only one relationship in Q for every value of 4>, 


9 



Figure 2: Examples of equitable and non-equitable behavior on a set of noisy functional relationships. 

(a) The equitability with respect to of the Pearson correlation coefficient p over the set Q of 
relationships described in Section 2.4, with n = 500. Each shaded region is an estimated 90% central 
interval of the sampling distribution of p for a given relationship at a given . The fact that the 
interpretable intervals of p are large indicates that a given p value could correspond to relationships 
with very different R^ values. This is illustrated by the pairs of thumbnails showing relationships with 
the same p but different R^ values. The largest interpretable interval is indicated by a red line. Because 
it has width 1, the worst-case equitability with respect to R? in this case is 1, the lowest possible. 

(b) A hypothetical population quantity p that achieves perfect equitability in the large-sample limit. 
Here, the value of p for each relationship type depends only on the R^ of the relationship and increases 
monotonically with R^. Thus, p can be used as a proxy for R^ on Q with no loss. Thumbnails are 
shown for sample relationships that have the same p, which corresponds to the fact that they have equal 

scores. See Appendix A for a legend of the function types used. 


then asymptotic monotonicity of (p with respect to is sufficient for perfect equitability in the 
large-sample limit. In this scenario, the main obstacle to the equitability of p is finite-sample 
effects, as with parameter estimation. For example, on the set Q of bivariate Gaussians, many 
measures of dependence are asymptotically perfectly equitable with respect to the correlation. 

However, this differs from the motivating data exploration scenario we consider, in which 
Q contains many different relationship types and there are multiple different relationships 
corresponding to a given value of ‘h. Here, equitability can be hindered either by finite-sample 
effects, or by the differences in the asymptotic behavior of p on different relationship types in 
Q. This is illustrated in Figure 3. 

Regardless of the size of Q though, equitability is fundamentally meant for a situation in 
which we cannot simply estimate ‘h directly. (In fact, if is a consistent estimator of <I> on Q, 
it is trivially perfectly equitable in the large-sample limit.) This is because in data exploration 
we typically require that be a measure of dependence in order to obtain a minimal robustness 
guarantee, and this requirement makes it very difficult to make p a consistent estimator of ‘h 
on a large set Q. For instance, suppose Q is a set of noisy functional relationships and ^ = R^. 
Here, computing the sample relative to a non-parametric estimate of the generating function 
will be asymptotically perfectly equitable. However, this approach is undesirable for data 
exploration because of its lack of robustness, as exemplified by the fact that it would assign a 
score of zero to, e.g., a circular relationship. Therefore, we are left with the problem of finding 
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Figure 3: Equitability versus parameter estimation. The left-hand column depicts a scenario in which 
9 estimates a parameter 9, each value of which specifies a unique distribution. If the population value of 
9 is monotonic in 9, then the confidence intervals shown can be large only due to finite-sample effects. 
The right-hand column depicts a scenario in which tp is being used as an estimate of but a given value 
of $ does not uniquely determine the population value of ip: the blue, red, and yellow each represent 
distinct sets of distributions in Q whose members can have identical values of <&. For instance, they 
might correspond to different function types. This is the setting in which we are operating, and the red 
intervals on the right are called interpretable intervals. Interpretable intervals can be large either because 
of finite sample effects (as in the conventional estimation case) or because of the lack of interpretability 
of the population value of the statistic (shown in the bottom-right picture). 


the next-best thing: a measure of dependence (p whose values have a clear, if approximate, 
interpretation in terms of <h. Equitability supplies us with a way of talking about how well (p 
does in this regard. 

We close this section with the observation that, though we largely focused here on setting 
Q to be some set of noisy functional relationships, the appropriate definitions of Q and 9 may 
change from application to application. For instance, instead of functional relationships one 
may be interested in relationships supported on one-manifolds, with added noise. Or perhaps 
instead of one may decide to focus on the mutual information between the sampled y-values 
and the corresponding de-noised y-values [17], or on the fraction of deterministic signal in a 
mixture [26]. In each case the overarching goal should be to have Q be as large as possible 
without making it impossible to define an interesting 9 or making it impossible to find a 
measure of dependence that achieves good equitability on Q with respect to this 9. Finding 
such families Q and properties 9 is an important avenue of future work. 
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3 Equitability and statistical power 


In the previous section we defined equitability in terms of interval estimation, and observed that 
the interpretable intervals of a statistic (p with respect to a property of interest yield interval 
estimates of on a set of distributions Q. Given our construction of interpretable intervals via 
inversion of a set of hypothesis tests, it becomes natural to ask whether there is any connection 
between equitability and the power of those tests with respect to specific alternatives. 

In this section we answer this question by showing that equitability can be equivalently 
formulated in terms of power with respect to a family of null hypotheses corresponding to 
different relationship strengths. This result re-casts equitability as a strengthening of power 
against statistical independence on Q and gives a second formal definition of equitability that 
is easily quantihable using standard power analysis. 

Henceforth, we fix the statistic ip and then use (x) to denote the a-reliable interval of (p 
at X G [0,1] and (x) to denote the a-interpretable interval of at y G [0,1]. 

3.1 Intuition 

Before stating and proving the relationship between equitability and power, let us first build 
some intuition for why it should hold. We begin by recalling that the reliable interval (xq) 
is an acceptance region of a two-sided level-a test of Hq : ^(-2^) = xq. Since the interval 
estimates obtained by inverting this test are the interpretable intervals of 0, it makes sense to 
ask whether there is any property of these hypothesis tests that improves as the interpretability 
of the statistic p increases. To see why the relevant property is power, let us consider the 
following illustrative question: what is the minimal xi > 0 such that a right-tailed^ level-a test 
of Ho : ‘h = 0 will have power at least 1 — 0 on Hi : <I> = xi? As shown graphically in Figure 4, 
the answer can be stated in terms of the reliable and interpretable intervals of p. 

Specifically, if ta is the maximal element of R 2 a (0), then the minimal value of at which 
a right-tailed test based on p will achieve power 1 — /3 is <I> = max/ 2 ^ {ta)^ he., the maximal 
element of the /3-interpretable interval at t^. So if the statistic is highly interpretable at then 
we will be able to achieve high power against very small departures from the null hypothesis of 
independence. That is, good interpretability on Q implies good power against independence 
on Q. It turns out that this reasoning holds in general and in both directions, as we establish 
below. 

3.2 Definitions 

To be able to state our main result, we need to formally describe how equitability would be 
formulated in terms of power. This requires two definitions. The first is a definition of a 
power function that parametrizes the space of possible alternative hypotheses specifically by 
the property of interest. The second is a definition of a property of this power function called 
its uncertain interval. It will turn out later than uncertain intervals are interpretable intervals 
and vice versa. 

^ We consider a one-sided test here, and henceforth in this section. The reason is because in practice when 
<0 corresponds to relationship strength, we are interested in rejecting a null hypothesis representing weaker 
relationships. In such a situation, it is more common to perform a one-sided test. Nevertheless, results similar 
to those shown in this section can be derived for two-sided tests as well. 
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ta = max_R2Q(0) 


R2ct(0) 

o 

max/2,3(ta) 

Figure 4: An illustration of the connection between equitability and power. In this example, we ask 
for the minimal x > 0 that allows a right-tailed test based on ip to achieve power 1 — ,5 in distinguishing 
between Hq : ^ = 0 and i?i : $ = x. The optimal critical value of such a test, denoted by ta, can be 
shown to be the maximal element of the reliable interval i? 2 „ (0), and the required x can be shown to be 
the maximal element of the interpretable interval I 2 P (ta)) provided maxi?^, (•) is an increasing function. 
(The reliable and interpretable intervals pictured are for the case that a = j5.) 



As before, let phe a statistic, let Q be a set of standard relationships, and let <f> : Q —)■ [0,1] 
be a property of interest defined on Q. Given a set of right-tailed tests based on the same test 
statistic, we refer to the one with the smallest critical value as the most permissive test. 

Definition 3.1. Fix a,xo G [0,1], and let be the most permissive level-a right-tailed test 
based on p of the (possibly composite) null hypothesis Hq : ^{Z) = xq. For xi G [0,1], define 

Kppixi) = inf P (r„"o(Z) rejects) 

Z-.^{Z)=xi 

where Z is a sample of size n from Z. That is, Kpp{xi) is the power of Tpp with respect to the 
composite alternative hypothesis Hi : <I> = xi. 

We call the function Kpp : [0,1] —)> [0,1] the level-a power function associated to p at xq 
with respect to $. 

Note that in the above definition our null and alternative hypotheses may be composite 
since they are based on $ and not on a complete parametrization of Q. That is, Z can be one 
of several distributions with ^(Z) = xq or <h(Z) = x respectively. 

Under the assumption that d>(Z) = 0 if and only if Z represents statistical independence, 
the power function gives the power of optimal level-a right-tailed tests based on p at 
distinguishing various non-zero values of $ from statistical independence across the different 
relationship types in Q. One way to view the main result of this section is that the set of 
power functions at values of xq besides 0 contains much more information than just the power 
of right-tailed tests based on p against the null hypothesis of = 0, and that this information 
can be equivalently viewed in terms of interpretable intervals. Specifically, we can recover the 
interpretability of p at every y G [0,1] by considering its power functions at values of xq beyond 
0 . 

Let us now define the precise aspect of the power functions associated to p that will allow 
us to do this. 

Definition 3.2. The uncertain set of a power function Kff is the set {xi > xq : Kff’{xi) < 
I — a}. 







The main result of this section will be that uncertain sets are interpretable intervals and 
vice versa. 

3.3 Preliminary lemmas 

Our proof of the alternate characterization of equitability in terms of power requires two short 
lemmas. The first shows a connection between the maximum element of a reliable interval and 
the minimal element of an interpretable interval, namely that these two operations are inverses 
of each other. 

Lemma 3.3. Given a statistic (p, a property of interest and some a G [0,1], define f{x) = 
maxi?Q (x) and g{y) = min/^ (y). If f is strictly increasing, then f and g are inverses of each 
other. 

Proof. Let y = f{x) = maxi?^, (x). We know that minl^ (y) < x, for if it were greater than 
X then we would have that x ^ (y), which would imply that y ^ (x), contradicting the 

dehnition of y. On the other hand, we cannot have min/^ (y) < x, because this would imply 
that there is some x' < x such that y G (x'), meaning that maxi?„ (x') > y = maxi?^,^ (x), 
which contradicts the fact that / is strictly increasing. □ 

The second lemma gives the connection between reliable intervals and hypothesis testing 
that we will exploit in our proof. 

Lemma 3.4. Fix a statistic (p, a property of interest and some a,xo G [O;!]- The most 
permissive level-{a/2) right-tailed test based on of the null hypothesis Hq : ^{2) = xq has 
critical value maxi?Q, (xq). 

Proof. We seek the smallest critical value that yields a level-( q;/ 2) test. This would be the 
supremum, over all Z with ^{Z) = xq, of the (1 — a/2) • 100% value of the sampling distribution 
of when applied to Z. By definition this is maxi?^ (xq). □ 

3.4 Proving the main result: equitability in terms of statistical power 

We are now ready to prove our main result, which is the following equivalent characterization 
of equitability in terms of statistical power. 

Theorem 3.5. Fix a set Q C V, a function $ : Q —> [0,1], and 0 < a < 1/2. Let be a 
statistic with the property that maxi? 2 a (®) ® strictly increasing function of x. Then for all 

d > 0, the following are equivalent. 

1. is worst-case 1 /d-interpretable with respect to with confidence 1 — 2a. 

2. For every xo,xi G [0,1] satisfying xi — xq > d, there exists a level-a right-tailed test based 
on that can distinguish between Hq : ^{Z) < xq and Hi : ^{Z) > xi with power at 
least 1 — a. 

Theorem 3.5 can be seen to follow from the proposition below. 

Proposition 3.6. Fix 0 < a < 1 and d > 0, and suppose (p is a statistic with the property that 
maxi?Q (x) is a strictly increasing function of x. Then for y G [0,1], the interval (y) equals 
the closure of the uncertain set of for xq = miml^ (y). Equivalently, for xq G [0,1], the 
closure of the uncertain set of equals (y) for y = maxR^ (xq). 
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Figure 5: The relationship between equitability and power, as in Proposition 3.6. The top plot is the 
same as the one in Figure la, with the indicated interval denoting the interpretable interval [y). The 
bottom plot is a plot of the power function K^i^{x), with the y-axis indicating statistical power. The 
key to the proof of the proposition is to notice that the width of the interpretable interval describes the 
distance from xq to the point at which the power function reaches 1 — a/2, and this is exactly the width 
of the uncertain set of the power function. (Notice that because the null and alternative hypotheses are 
composite, need not equal a/2; in general it may be lower.) 


An illustration of this proposition and its proof is shown in Figure 5. 

Proof. The equivalence of the two statements follows from Lemma 3.3, which states that y = 
maxi?Q, (xo) if and only if xq = minlo, (y). We therefore prove only the first statement, namely 
that (y) is the uncertain set of = min (y). 

Let U be the uncertain set of • We prove the claim by showing first that inf U = 
min/o, (y), and then that supU = maxl^ (y). 

To see that inf [/ = min/^(y), we simply observe that because a/2 < 1/2, we have 
< a/2 < 1 — a/2, which means that U is non-empty, and so by construction its 
infimum is xq, which we have assumed equals min/„ (y). 

Let us now show that supU > max/„ (y): by the definition of the interpretable interval, 
we can find x arbitrarily close to max/^ (y) from below such that y G i?„ (x). But this means 
that there exists some Z with ^{Z) = x such that if Z is a sample of size n from Z then 

p {0{Z) < y) > f 
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i.e., 

P{ip{Z)>y)<l-^. 

But since as we already noted y = niaxi?„ (ico); Lemma 3.4 tells us that it is the critical value 
of the most permissive level-(a/2) right-tailed test of Hq : 4>(Z) = xq- Therefore, < 

1 — a/2, meaning that x G U. 

It remains only to show that sup U < max/„ (y). To do so, we note that y ^ (x) for all 

X > max/„ (y). This implies that either y > maxi?^ (x) or y < mini?^, (x). However, since 
y G Ra (a^o) and maxi?„ (•) is an increasing function, no x > xq can have y > maxi?„ (x). Thus 
the only option remaining is that y < mini?^ (x). This means that if Z is a sample of size n 
from any Z with ^{Z) = x > maxl^ (y), then 

P(^(Z)<y)<| 

i.e., 

PmZ)>y)>l-^. 

As above, this implies that K2°^{x) > 1 — a/2, which means that x ^ U, as desired. □ 

3.5 Quantifying equitability via statistical power 

Theorem 3.5 gives us an alternative to measuring equitability via lengths of interpretable in¬ 
tervals. Instead, for every xq G [0,1) and for every xi > xq, we can use many samples of 
size n to estimate the power of right-tailed tests based on ip at distinguishing Hq : ^ = xq 
from Hi : <I> = xi. This process is illustrated schematically in Figure 6. In that figure, good 
equitability corresponds to high power on pairs (xi,xo) even when xi — xq is small. 

3.6 Discussion 

In this section, we gave a characterization of equitability in terms of statistical power with 
respect to a family of null hypotheses corresponding to different relationship strengths. (See 
Theorem 3.5.) This characterization shows what the concept of equitability/interpretability is 
fundamentally about: being able to distinguish not just signal (<b > 0) from no signal (<I> = 0) 
but also stronger signal (<I> = xi) from weaker signal (<1> = xq), and being able to do so 
across relationships of different types. This indeed makes sense when a data set contains an 
overwhelming number of heterogeneous relationships that exhibit, say, ^{Z) = 0.3 and that we 
would like to ignore because they are not as interesting as the small number of relationships 
with, say, ^{Z) = 0.8. 

Let us now explore how the power requirement into which equitability translates differs from 
the conventional lens through which measures of dependence are analyzed. We do so by return¬ 
ing once more to the case in which Q is a set of noisy functional relationships and the property 
of interest is R?. In this setting, the conventional way to assess a measure of dependence would 
be through analysis of its power with respect to a null hypothesis of independence and with 
a simple alternative hypothesis. Such an analysis would consider, say, right-tailed tests based 
on the statistic (p and evaluate their power at rejecting the null hypothesis of R? = 0, i.e. 
statistical independence, first on linear relationships with varying noise levels, then separately 
on exponential relationships with varying noise levels, and so on. 
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Figure 6: A schematic illustration of the visualization of equitability via statistical power. (Top) A 
depiction of the sampling distributions of a test statistic when a data set contains only four relation¬ 
ships: a parabolic and a linear relationship with <i> = 0.3, and a parabolic and a linear relationship with 
$ = 0.6. The dashed line represents the critical value of the most permissive level-a right-tailed test of 
Hq : $ = 0.3. (Bottom left) The power function of the most permissive level-a right-tailed test based 
on a statistic (p of the null hypothesis Hq : $ = 0.3. The curve shows the power of the test as a function 
of Xi, the value of $ that defines the alternative hypothesis. (Bottom middle) The power function 
can be depicted instead as a heat map. (Bottom right) Instead of considering just one null hypothesis, 
we can consider a set of null hypotheses (with corresponding critical values) of the form Hq : $ = a;o 
and plot each of the resulting power curves as a heat map. The result is a plot in which the intensity 
of the color in the coordinate {xi,xq) corresponds to the power of the size-a right-tailed test based on 
ip at distinguishing Hi ■. ^ = xi from i/o : d) = xq. A statistic is 1/d-equitable with confidence 1 — 2a 
if this power surface attains the value 1 — a within distance d of the diagonal along each row. In other 
words, the redder the triangle appears, the higher the equitability of (p. 



In contrast, our result shows that for p to be l/d-equitable, it must yield right-tailed 
tests with high power at distinguishing null hypotheses of the form < xq from alternative 
hypotheses of the form E? > xi for any xi > xo+d. This is more stringent than the conventional 
analysis described above for the following three reasons. 

1. Instead of just one null hypothesis xq (i.e., xq = 0), there are many possible values of xq 
corresponding to different values. 

2. Each of the new null hypotheses can be composite since Q can contain relationships of 
many different types (e.g. noisy linear, noisy sinusoidal, and noisy parabolic). Whereas 
for many measures of dependence all of these relationships may have reduced to a single 
null hypothesis of statistical independence in the case of = 0, they yield composite 
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null hypotheses once we allow E? to be non-zero. 

3. The alternative hypotheses here are also composite, since each one similarly consists of 
several different relationship types with the same B?. Whereas conventional analysis of 
power against independence considers only one alternative at a time, here we require that 
tests simultaneously have good power on sets of alternatives with the same B?. 

This understanding of equitability is both good news and bad news. On the one hand, it 
provides us with a concrete sense of the relationship of equitability to power against indepen¬ 
dence, which has been the more traditional way of evaluating measures of dependence. In so 
doing, it also makes clear the motivation behind equitability and the cases in which it is useful. 
On the other hand, however, the understanding that equitability corresponds to power against 
a much larger set of null hypotheses suggests, via “no free lunch”-type considerations, that if 
we want to achieve higher power against this larger set of null hypotheses, we may need to 
give up some power against independence. And indeed, in [16] we demonstrate empirically that 
such a trade-off does seem to exist for several measures of dependence. 

However, there are situations in which it may be desirable to give up some power against 
independence in exchange for a degree of equitability. For instance, recall the analysis [14] of the 
gene expression data set discussed earlier in this paper. In that analysis, not only did several 
measures of dependence each detect thousands of significant relationships after correction for 
multiple hypothesis testing, but there was also an overlap of over 85% among the relationships 
detected by the five best-performing methods. In data exploration scenarios such as this one, 
in which existing measures of dependence reliably identify so many relationships, focusing 
on additional gains in power against independence appears less of a significant priority than 
deciding how to choose among the large number of relationships already detected. 

4 Equitability implies low detection threshold 

The primary motivation given for equitability is that often data sets contain so many rela¬ 
tionships that we are not interested in all deviations from independence but rather only in 
the strongest few relationships. However, there are also many data sets in which, due to low 
sample size, multiple-testing considerations, or relative lack of structure in the data, very few 
relationships pass significance. Alternatively, there are also settings in which equitability is too 
ambitious even at large sample sizes. In such settings, we may indeed be interested in simply 
detecting deviations from independence rather than ranking them by strength. 

In this situation, there is still cause for concern about the effect on our results of our choice 
of test statistic (p. For instance, it is easy to imagine that, despite asymptotic guarantees, an 
independence test will suffer from low power even on strong relationships of a certain type at a 
finite sample size n because the test statistic systematically assigns lower scores to relationships 
of that type. To avoid this, we might want a guarantee that, at a sample size of n, the test has 
a given amount of power in detecting relationships whose strength as measured by <I> is above 
a certain threshold, across a broad range of relationship types. This would ensure that, even if 
we cannot rank relationships by strength, we at least will not miss important relationships as 
a result of the statistic we use. 

In this section we show a straightforward connection between equitability as defined above 
and this desideratum, which we call low detection threshold. In particular, we show via the 
alternate characterization of equitability proven in the previous section that low detection 
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threshold is a straightforward consequence of high equitability. Since the converse does not hold, 
low detection threshold may be a reasonable criterion to use in situations in which equitability 
is too much to ask. 

Given a set Q of standard relationships, and a property of interest <i>, we dehne low detection 
threshold as follows. 

Definition 4.1. A statistic (p has a (1 —13)-detection threshold of d at level a with respect to 
on Q if there exists a level-a right-tailed test based on (p of the null hypothesis Hq : 4>(Z) = 0 
whose power oxi Hi ■. Z sX, & sample size of n is at least 1 — /3 for a\\ Z £ Q with ^(Z) > d. 

The connection between equitability and low detection threshold is then a straightforward 
corollary of Theorem 3.5. 

Corollary 4.2. Fix some 0 < a < 1, let (p be worst-case 1 /d-interpretable with respect to <1> on 
Q with confidence 1 — 2a, and assume that maxi22Q (') ® strictly increasing function. Then 

(f has a (1 — a)-detection threshold of d at level a with respect to <h on Q. 

Assume that 4> has the property that it is zero precisely in cases of statistical independence. 
Then the above corollary says that equitability and interpretability — to the extent they can 
be achieved — make strong guarantees about power against independence on Q. On the other 
hand, it is easy to see that low detection threshold need not imply equitability. Therefore, 
minimal power against independence is a strictly weaker criterion than equitability. 

The connection between equitability and detection threshold with respect to <1> is important 
because there exist situations in which equitability may be difficult to achieve but in which we 
still want some sort of guarantee about the robustness of our power against independence to 
changes in relationship type. This general theme of not missing relationships because of their 
type is the intuitive heart of equitability, and the above corollary shows how this conception 
might be utilized in other ways. 

Another way that low detection threshold arises naturally is if we pre-hlter our data set 
using some independence test before conducting a more hne-grained analysis with a second 
statistic. In that case, low detection threshold ensures that we will not “throw out” important 
relationships prematurely just because of their relationship type. In our companion paper [16], 
we propose precisely such a scheme, and we analyze the detection threshold of the preliminary 
test in question to argue that the scheme will perform well. 

5 Quantifying equitability in practice 

Having defined equitability and seen how it can be interpreted in terms of power, we now 
consider the equitability on a set of noisy functional relationships of some commonly used 
methods: the maximal information coefficient as estimated by MICe [4], distance correlation 
[5, 24, 27], and mutual information [21, 22] as estimated using the Kraskov estimator [6]. 

In this analysis, we use ^ as our property of interest, n = 500 as our sample size, and 

Q = {{x + ea,f{x) : x G ~ AA(0,cr^),/ e F,a e M>o} 

where Ca and are i.i.d., F is the set of functions in Appendix A, and Xf is the set of n 
x-values that result in the points {xi, f{xi)) being equally spaced along the graph of /. 

The results of the analysis are shown in Figure 7. The hgure visualizes the analysis via 
both interpretable intervals and statistical power. By Theorem 3.5, these two viewpoints are 
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Figure 7: An analysis of the equitability with respect to of three measures of dependence on 
a set of functional relationships. The set of relationships used is described in Section 5. Each column 
contains results for the indicated measure of dependence. (Top) The analysis visualized via interpretable 
intervals as in Figure 2. [Narrower is more equitable, j The worst-case and average-case widths of the 0.1- 
interpretable intervals for the statistic in question are indicated. (Bottom) The same analysis visualized 
via statistical power as in Figure 6. [Redder is more equitable.] The average power across all pairs of 
null and alternative hypotheses is computed for each plot. For a legend describing which functional 
relationships were analyzed and which parameters were used for each method, see Appendix A. 


equivalent, and they are both shown here in order to help the reader build intuition for this 
equivalence. For instance, the worst-case 0.1-interpretability of MlCg here is 2.92, because the 
widest interpretable interval is of size 2.92. And indeed, MICe yields right-tailed tests with 
1 — 0.1/2 = 95% power at distinguishing any null hypothesis of the form Hq : E?{Z) = xq from 
any alternative hypothesis of the form Hi : R^{Z) = xi provided xi — xq > 1/2.92 = 0.342. 

As the figure demonstrates, the equitability of 2.92 achieved by MICe on this Q is the 
highest among the methods examined. In contrast, the equitabilities with respect to B? of 
distance correlation and mutual information estimation on this Q are 1 and 1.04, respectively. 
For a more extensive analysis that varies the sample size as well as noise model and marginal 
distributions, and compares many more methods, see [16]. 

6 Conclusion 

Informally, given some measure of relationship strength, the equitability of a measure of 
dependence (p with respect to is the degree to which </> allows us to draw inferences about 
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relationship strength across a broad set of relationship types. We give here a conceptual frame¬ 
work to motivate equitability and then discuss the contributions of this work. 

6.0.1 The motivation for equitability 

There are two different ways to motivate equitability. The first is to begin with a measure 
of dependence (p and to observe that, though (p will asymptotically allow us to detect all 
deviations from independence in a data set, it need not tell us anything about the strength of 
those relationships. Since it often happens that we detect many more relationships than can 
be realistically followed up, it would be desirable to have p tell us something not just about 
the presence or absence of a relationship, but also about relationship strength as defined by ‘h 
on at least a partial set of “standard relationships” Q. 

The second way is to suppose that is a consistent estimator of on Q and to ask “what 
is the minimal requirement we can add to ensure that p is robust to detecting relationships 
outside of Q?” Perhaps the weakest stipulation we can impose is that the population value (p 
of our statistic be non-zero in cases of non-trivial dependence of any sort. That is, we want p 
to be a measure of dependence as well. 

Both of these scenarios would be resolved by a measure of dependence that is also a consis¬ 
tent estimator of However, in many interesting cases there is no known statistic satisfying 
both properties: for instance, if Q is a set of noisy functional relationships and is then 
on the one hand computing the sample B? with respect to a non-parametric estimate of the 
generating function will be a consistent estimator of <h, but will give a score of 0 to a circle. 
And on the other hand, no measure of dependence is known also to be a consistent estimator 
of B? on noisy functional relationships. 

This naturally leads us to wonder whether, despite the difficulty of simultaneously estimat¬ 
ing <I> consistently and retaining the properties of a measure of dependence, we can at least 
seek an approximate version of this ideal. Doing so, however, requires a weaker requirement 
than consistent estimation. This is what leads us to equitability. Equitability allows us to seek 
statistics that have the robustness of measures of dependence but that also, via their relation¬ 
ship to a property of interest ‘h, give values that have a clear, if approximate, interpretation 
and can therefore be used to rank relationships. 

6.0.2 Contributions of this work 

In this paper, we formalized and developed the theory of equitability in three ways. We first 
defined the equitability of a statistic p on Q with respect to as the extent to which p give 
us good interval estimates of <I> on Q. Our definition rests on an object called the interpretable 
interval, which has coverage guarantees with respect to <!>. We define p to be equitable if all of 
its interpretable intervals are small. 

Second, we showed that this formalization of equitability can be equivalently stated in 
terms of power against a specific set of null hypotheses corresponding to different relationship 
strengths. That is, while measures of dependence have conventionally been judged by their 
power at distinguishing non-trivial signal from statistical independence, equitability is equiva¬ 
lent to the stronger property of being able to distinguish different degrees of possibly non-trivial 
signal strength from each other. 

Third, we defined a concept called low detection threshold, which stipulates that, at a fixed 
sample size, a statistic yield independence tests with a guaranteed minimal power to detect 
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relationships whose strength passes a certain threshold, across a range of relationship types. 
We showed that low detection threshold is a straightforward consequence of equitability. Since 
the converse does not hold, low detection threshold is a natural weaker criterion that one could 
aim for when equitability proves difficult to achieve. 

Our formalization and its results serve three primary purposes. The first is to provide a 
framework for rigorous discussion and exploration of equitability and related concepts. The 
second is to situate equitability in the context of interval estimation and hypothesis testing and 
to clarify its relationship to central concepts in those areas such as confidence and statistical 
power. The third is to show that equitability and the language developed around it can help 
us to both formulate and achieve other useful desiderata for measures of dependence. 

These connections provide a framework for thinking about the utility of both current and 
future measure of dependence for exploratory data analysis. Power against independence, the 
lens through which measures of dependence are currently evaluated, is appropriate in many 
settings in which very few significant relationships are expected, or in which we want to know 
whether one specific relationship is non-trivial or not. However, in situations in which most 
measures of dependence already identify a large number of relationships, a rigorous theory of 
equitability will allow us to begin to assess when we can glean more information from a given 
measure of dependence than just the binary result of an independence test. 

Of course, there is much left to understand about equitability. For instance, to what extent 
is it achievable for different properties of interest? What are natural and useful properties 
of interest for sets Q besides noisy functional relationships? For common statistics such as 
MIC [1] or MICe [4], can we obtain a theoretical characterization of the sets Q for which good 
equitability with respect to B? is achieved? Are there systematic ways of obtaining equitable 
behavior via a learning framework as was done for causation in [28]? These questions all deserve 
attention. 

Equitability as framed here is certainly not the only goal to which we should strive in 
developing new measures of dependence. As data sets not only grow in size but also become 
more varied, there will undoubtedly develop new and interesting use-cases for measures of 
dependence, each with its own way of assessing success. Notwithstanding which particular 
modes of assessment are used, it is important that we formulate and explore concepts that move 
beyond power against independence, at least in the bivariate setting. Equitability provides one 
approach to coping with the changing nature of data exploration, but more generally, we can 
and should ask more of measures of dependence. 
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A Details of analyses 

A.l Functions analysed in Figures 2 and 7 

Below is the legend showing which function types correspond to the colors in each of Figures 2 
and 7. The functions used are the same as the ones in the equitability analyses of [16]. 


I Cosine, High Freq 

I Cosine, Non-Fourier Freq [Low] 

I Cosine, Varying Freq [Medium] 

I Cubic 

I Cubic, Y-Stretched 

I Exponentiai [2 ’^] 

I Line 

I Linear+Periodic, High Freq 

I Linear+Periodic, High Freq 2 

I Linear+Periodic, Low Freq 

I Linear+Periodic, Medium Freq 

I Parabola 

I Sine, High Freq 

I Sine, Low Freq 

I Sine, Non-Fourier Freq [Low] 

I Sine, Varying Freq [Medium] 

The legend for Figures 2 and 7. 


A.2 Parameters used in Figure 7 

In the analysis of the equitability of MlCg, distance correlation, and mutual information, the 
following parameter choices were made: for MICe, a = 0.8 and c = 5 were used; for distance 
correlation no parameter is required; and for mutual information estimation via the Kraskov 
estimator, k = 6 was used. The parameters chosen were the ones that maximize overall equi¬ 
tability in the detailed analyses performed in [16]. For mutual information, the choice of /c = 6 
(out of the parameters tested: k = 1,6,10,20) also maximizes equitability on the specific set 
Q that is analyzed in Figure 7. 
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